feat: MarkdownHeaderSplitter by OGuggenbuehl · Pull Request #9660 · deepset-ai/haystack

OGuggenbuehl · 2025-07-29T13:55:52Z

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

CLAassistant · 2025-07-29T13:55:59Z

All committers have signed the CLA.

haystack/components/preprocessors/markdown_header_splitter.py

sjrl · 2025-08-19T11:46:09Z

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

sjrl · 2025-09-18T06:37:07Z

Thanks for your continued work on this @OGuggenbuehl!

Some general comments. Could you:

Add a release note for this PR following the instructions here
Could you make sure to include our license header to the beginning of each file you've added. You can find an example of the license header here
Please make sure to sign the CLA agreement (docs about it here) from this comment
If you haven't already please also set up pre-commit hooks using pre-commit install. You can find more info about that in this section of our contribution guidelines.
Also in the future feel free to open branches directly in Haystack instead of using a fork. This makes it slightly easier to pull down your code to review locally.

haystack/components/preprocessors/markdown_header_splitter.py

test/components/preprocessors/test_markdown_header_splitter.py

haystack/components/preprocessors/markdown_header_splitter.py

coveralls · 2025-09-19T15:44:23Z

Pull Request Test Coverage Report for Build 19816136481

Details

0 of 0 changed or added relevant lines in 0 files are covered.
3 unchanged lines in 1 file lost coverage.
Overall coverage remained the same at 92.189%

Files with Coverage Reduction	New Missed Lines	%
core/pipeline/async_pipeline.py	3	65.88%

Totals
Change from base Build 19775495543:	0.0%
Covered Lines:	14174
Relevant Lines:	15375

💛 - Coveralls

minor commenting

sjrl · 2026-01-29T13:33:38Z

haystack/components/preprocessors/markdown_header_splitter.py

+# SPDX-License-Identifier: Apache-2.0
+
+import re
+from typing import Literal, Optional


Apologies we have moved to using python 3.10 types since this PR is opened. So if you could drop Optional and instead use | None instead that would be great!

If you could also update your branch with current main then the formatting scripts and tests should catch this change for you

haystack/components/preprocessors/markdown_header_splitter.py

sjrl · 2026-01-29T13:44:12Z

haystack/components/preprocessors/markdown_header_splitter.py

+            current_page = doc.meta.get("page_number", 1) if doc.meta else 1
+            total_pages = doc.content.count(self.page_break_character) + 1
+            logger.debug(
+                "Processing page number: {current_page} out of {total_pages}",
+                current_page=current_page,
+                total_pages=total_pages,
+            )


Correct me if I'm wrong but this doesn't sound quite right. The incoming document is usually a converted PDF file from a converter that hasn't yet been split. So this would mean the page_number probably doesn't exist yet in the meta data.

Either way the message "Processing page number: {current_page} out of {total_pages}" I think can be off for a few reasons. We are not just processing current_page we are processing all the pages right?

Also currently you don't offset total pages by current page so we could end up with message like "Processing page number: 10 out of 2" if page_number from meta was equal to 10 and this doc only had one page_break_character in it right?

haystack/components/preprocessors/markdown_header_splitter.py

sjrl · 2026-01-29T13:52:35Z

haystack/components/preprocessors/markdown_header_splitter.py

+            if not self.keep_headers and content.startswith("\n"):
+                content = content[1:]  # remove leading newline if headers not kept


I'd say let's drop this update and keep the leading newline character. We should utilize a DocumentCleaner after this splitter if we want to clean up this kind of leading and trailing whitespace type characters

sjrl · 2026-01-29T14:00:40Z

haystack/components/preprocessors/markdown_header_splitter.py

+            # skip splits w/o content
+            if not content.strip():  # this strip is needed to avoid counting whitespace as content
+                # add as parent for subsequent headers
+                active_parents = [h for h in header_stack[: level - 1] if h is not None]
+                active_parents.append(header_text)
+                if self.keep_headers:


I think this can produce an unwanted edge-case which is if keep_headers is True then they will be added to the content below on line 136 but we never reach there if the content is empty since we will hit the continue on 127. So currently it seems to me that this if self.keep_headers: doesn't do anything since once we reach here we will always skip this match anyways

haystack/components/preprocessors/markdown_header_splitter.py

sjrl · 2026-01-29T14:17:24Z

test/components/preprocessors/test_markdown_header_splitter.py

+
+    # Check that content is present and correct
+    # Test first split
+    header1_doc = split_docs[0]


never mind I see it's there already

sjrl · 2026-01-29T14:20:18Z

test/components/preprocessors/test_markdown_header_splitter.py

+def test_page_break_handling_with_multiple_headers():
+    text = "# Header\nFirst page\f Second page\f Third page"


I like the name of the test! But I don't see multiple headers. Could we update the example to use multiple headers and sub-headers?

I think ideally you could make a copy of the sample_text fixture that also includes page breaks.

This one is hard to defend – I think this is an artifact of me reworking and merging tests. I'll make sure to make this one consistent with its name, my bad!

sjrl · 2026-01-29T14:22:40Z

test/components/preprocessors/test_markdown_header_splitter.py

+    assert subheader123_doc.content == "Content under header 1.2.3."
+
+
+def test_split_parentheaders(sample_text):


I think we can remove this test now since we test the parent_headers in test_split_without_headers and test_basic_split

sjrl · 2026-01-29T14:24:33Z

test/components/preprocessors/test_markdown_header_splitter.py

+    headers = {doc.meta["header"] for doc in split_docs}
+    assert {"Another Header", "H1", "H2"}.issubset(headers)


Let's not use these vague checks. Let's explicitly check that the expected doc has the expected header. So like

split_docs[X] == "Another Header" split_docs[Y] == "H1" ...

test/components/preprocessors/test_markdown_header_splitter.py

sjrl · 2026-01-29T14:30:33Z

test/components/preprocessors/test_markdown_header_splitter.py

+    docs = [Document(content=text)]
+    result = splitter.run(documents=docs)
+    split_docs = result["documents"]
+    assert len(split_docs) == 24


Let's also run the same checks for this output

for i in range(1, len(split_docs)): prev_doc = split_docs[i - 1] curr_doc = split_docs[i] if prev_doc.meta["header"] == curr_doc.meta["header"]: # only check overlap within same header prev_words = prev_doc.content.split() curr_words = curr_doc.content.split() assert prev_words[-2:] == curr_words[:2]

sjrl · 2026-01-29T14:31:24Z

test/components/preprocessors/test_markdown_header_splitter.py

+    docs = [Document(content=text)]
+    result = splitter.run(documents=docs)
+    split_docs = result["documents"]
+    assert len(split_docs) == 21


Could we also add checks that the split_ids are as expected?

sjrl · 2026-01-29T14:32:28Z

test/components/preprocessors/test_markdown_header_splitter.py

+    assert len(split_docs[0].content.split()) == 4  # "# Header" + 2 words
+    assert len(split_docs[1].content.split()) == 3  # 3 words (split_length)
+    assert len(split_docs[2].content.split()) == 3  # 3 words (split_length)
+    assert len(split_docs[3].content.split()) == 2  # 2 words (meets threshold)


Let's also add split_id checks

Could we also update the assertions to be string comparisons instead of lengths? That would make it easier to see if something went wrong.

sjrl · 2026-01-29T14:32:38Z

test/components/preprocessors/test_markdown_header_splitter.py

+    assert len(split_docs) == 3
+    assert len(split_docs[0].content.split()) == 3  # 3 words
+    assert len(split_docs[1].content.split()) == 3  # 3 words
+    assert len(split_docs[2].content.split()) == 4  # 4 words (due to threshold, not possible to split 3-1)


let's also add split_id checks here

Co-authored-by: Sebastian Husch Lee <[email protected]>

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025

OGuggenbuehl changed the title ~~Feature/md header splitter~~ feat:MarkdownHeaderSplitter Jul 29, 2025

sjrl self-assigned this Aug 19, 2025

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl changed the title ~~feat:MarkdownHeaderSplitter~~ feat: MarkdownHeaderSplitter Aug 27, 2025

OGuggenbuehl force-pushed the feature/md-header-splitter branch from 61a8396 to bcbbf9a Compare September 16, 2025 13:57

sjrl reviewed Sep 18, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 18, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 18, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 18, 2025

View reviewed changes

test/components/preprocessors/test_markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 18, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

github-actions bot added the topic:CI label Sep 19, 2025

OGuggenbuehl marked this pull request as ready for review September 19, 2025 16:05

OGuggenbuehl requested review from a team as code owners September 19, 2025 16:05

OGuggenbuehl removed the request for review from a team September 19, 2025 16:05

OGuggenbuehl added 3 commits December 1, 2025 09:28

test cleanup

64ff6fb

test splits more explicitly

eb3e568

cleanup tests

ad155cc

minor commenting

OGuggenbuehl force-pushed the feature/md-header-splitter branch from c7264e6 to ad155cc Compare December 1, 2025 08:28

Merge branch 'main' into feature/md-header-splitter

1c3897c

sjrl reviewed Jan 29, 2026

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Jan 29, 2026

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Show resolved Hide resolved

sjrl reviewed Jan 29, 2026

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Jan 29, 2026

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Show resolved Hide resolved

sjrl reviewed Jan 29, 2026

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Jan 29, 2026

View reviewed changes

test/components/preprocessors/test_markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Jan 29, 2026

View reviewed changes

Update haystack/components/preprocessors/markdown_header_splitter.py

6339f07

Co-authored-by: Sebastian Husch Lee <[email protected]>

auto-merge was automatically disabled January 30, 2026 13:31
Head branch was pushed to by a user without write access

OGuggenbuehl and others added 5 commits January 30, 2026 14:31

Update haystack/components/preprocessors/markdown_header_splitter.py

b2455c0

Co-authored-by: Sebastian Husch Lee <[email protected]>

Update haystack/components/preprocessors/markdown_header_splitter.py

1fb1671

Co-authored-by: Sebastian Husch Lee <[email protected]>

Update haystack/components/preprocessors/markdown_header_splitter.py

4d166e6

Co-authored-by: Sebastian Husch Lee <[email protected]>

Update test/components/preprocessors/test_markdown_header_splitter.py

10be09d

Co-authored-by: Sebastian Husch Lee <[email protected]>

Merge branch 'main' into feature/md-header-splitter

42297d9

		if not self.keep_headers and content.startswith("\n"):
		content = content[1:] # remove leading newline if headers not kept

		def test_page_break_handling_with_multiple_headers():
		text = "# Header\nFirst page\f Second page\f Third page"

		assert subheader123_doc.content == "Content under header 1.2.3."


		def test_split_parentheaders(sample_text):

		headers = {doc.meta["header"] for doc in split_docs}
		assert {"Another Header", "H1", "H2"}.issubset(headers)

Conversation

OGuggenbuehl commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes:

How did you test it?

Checklist

Uh oh!

CLAassistant commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjrl commented Aug 19, 2025

Uh oh!

sjrl commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19816136481

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sjrl Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

OGuggenbuehl commented Jul 29, 2025 •

edited

Loading

CLAassistant commented Jul 29, 2025 •

edited

Loading

sjrl commented Sep 18, 2025 •

edited

Loading

coveralls commented Sep 19, 2025 •

edited

Loading

sjrl Jan 29, 2026 •

edited

Loading